Skip to content

Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+)#22522

Draft
aendk wants to merge 31 commits intoggml-org:masterfrom
aendk:akieslinger/pdl-cuda-lc-experiments
Draft

Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+)#22522
aendk wants to merge 31 commits intoggml-org:masterfrom
aendk:akieslinger/pdl-cuda-lc-experiments

Conversation

@aendk
Copy link
Copy Markdown
Contributor

@aendk aendk commented Apr 29, 2026

Overview

Programmatic Dependent Launch (PDL) is a CUDA optimization for newer NVIDIA GPUs (CC >= 90; does not include Ada).
It enables overlapping execution of CUDA kernels of the same CUDA stream. Like CUDA graphs, it reduces kernel launch overhead on the device. The benefits of both are additive (PDL + CG > CG > PDL).
This can best be seen visually in this Nsight Systems screenshot of a single CUDA stream; kernels which should normally be strictly ordered are run concurrently:
Screenshot 2026-04-29 at 15 58 35

PDL was already proposed last year in #15479.
This PR integrates better into the CUDA graph semantics, and has vastly better performance. On an RTX PRO 6000, a token generation phase speedup of 10% is not unusual, on DGX Spark, I've seen 4-5% improvement (model dependent, see detailed stats below).

For full PDL performance, kernels need to be equipped with two new features: A synchronization barrier (GGML_CUDA_PDL_SYNC) and a launch signal (GGML_CUDA_PDL_LC). The synchronization barrier limits the kernel execution to wait on the data written by the preceeding kernel so that no race conditions or premature data accesses take place. The launch signal indicates at which point the current kernel can tolerate the start of the next kernel alongside it. Additionally, kernels need to be launched via the new ggml_cuda_kernel_launch() function.

The synchronization barrier can be placed by carefully inspecting the kernel code and identifying the first "real" data access (e.g. excluding pointer arithmetic) of the kernel input. The launch signal placement requires a bit of hand-tuning and benchmarking. In this draft PR, I enrolled all kernels used in gpt-oss 20b, qwen3.5 and nemotron 120B Super. Because these kernels are shared with other models, I've tested more models. I saw speed-ups in almost all models in token generation phases, with prefill/context phases being mostly neutral.

Applied Heuristics:

  • In this draft, for the synchronization barrier placement, I assumed that the first "real" data access of each kernel to be an input tensor. If the are cases where a preceding kernel outputs a scalar and the current kernel reads this scalar before GGML_CUDA_PDL_SYNC, a data race could occur. Before marking this merge-ready, I will double check this again. When reviewing, this should be kept in mind.
  • Correct placement of GGML_CUDA_PDL_LC is a bit of trial and error. This is visible in some kernels where I've commented out some suboptimal placements in some commits. In some kernels, placing GGML_CUDA_PDL_LC is even perf negative (most notably mul_mat_vec_q). Generally, the earlier the signal is placed in the kernel, the more latency limited the kernel is, and the more shared resource contention (due to the premature launch of the successive kernel) the kernel can tolerate.

Further Info on this Implementation

  • This approach can be used even if some kernels in the graph are not enrolled into PDL. If two successive kernels are enrolled, they leverage PDL (eg quantize_q8 and mul_mat_vec_q are enrolled in PDL and are present in many models).
  • Kernels can be enrolled one-by-one.
  • Optimizing the placement of the GGML_CUDA_PDL_LC flag is a bit of trial & error, but good placement for one model appears to be beneficial for other models, too. In internal testing, I did not run into settings which are for example beneficial for model A, but worse for model B performance.

Known issues/TODOs

  • Currently, there is no tooling like memcheck to identify a race condition in the case of an incorrectly placed GGML_CUDA_PDL_SYNC.
  • Need to find a way to automatically disable PDL for unsupported (NVIDIA) GPUs. A simple check on GGML_CUDA_CC_HOPPER did not work.
  • More kernels can be moved to PDL (different launch + sync barrier).
  • Need to remove commented out launch signal experimentation.
  • Like for CUDA graphs themselves, it might make sense to roll this feature out for token generation only at first. Need to check if that is feasible.

How to test it

You need to have a newer NVIDIA GPU (e.g. Blackwell), and you need to compile with -D GGML_CUDA_PDL=ON

How to enroll other kernels into PDL

  • Step 1 : modify the kernel launch with ggml_cuda_kernel_launch() and set GGML_CUDA_PDL_SYNC(). Modifying the kernel launch without setting the sync barrier leads to a race condition.
  • Step 2: Iterate on the placement of GGML_CUDA_PDL_LC(). My loose heuristic was to place it at the function start, measure performance, and then repeat the process for different locations in the middle of the kernel. I then picked the best performing placement. In my testing, placing it near the bottom of a kernel was almost always unproductive.

Let me know if you are able to test it ! @ggerganov @JohannesGaessler @am17an @ORippler

Performance:

RTX PRO 6000
| Model                              | Test   |   t/s master |   t/s akieslinger/pdl-cuda-lc-experiments |   Speedup |
|:-----------------------------------|:-------|-------------:|------------------------------------------:|----------:|
| gpt-oss 20B MXFP4 MoE              | pp512  |     12490.92 |                                  12424.74 |      0.99 |
| gpt-oss 20B MXFP4 MoE              | pp1024 |     12705.95 |                                  12729.38 |      1.00 |
| gpt-oss 20B MXFP4 MoE              | pp2048 |     12792.62 |                                  12828.74 |      1.00 |
| gpt-oss 20B MXFP4 MoE              | tg128  |       332.05 |                                    376.31 |      1.13 |
| gpt-oss 20B MXFP4 MoE              | tg256  |       335.49 |                                    375.20 |      1.12 |
| gpt-oss 20B MXFP4 MoE              | tg512  |       352.94 |                                    370.68 |      1.05 |
| llama 3B Q4_K_M                    | pp512  |     21970.62 |                                  21753.85 |      0.99 |
| llama 3B Q4_K_M                    | pp1024 |     21711.02 |                                  21676.37 |      1.00 |
| llama 3B Q4_K_M                    | pp2048 |     20886.10 |                                  20911.59 |      1.00 |
| llama 3B Q4_K_M                    | tg128  |       405.95 |                                    437.33 |      1.08 |
| llama 3B Q4_K_M                    | tg256  |       421.68 |                                    436.90 |      1.04 |
| llama 3B Q4_K_M                    | tg512  |       403.06 |                                    433.63 |      1.08 |
| llama 70B Q4_K_M                   | pp512  |      1247.76 |                                   1262.12 |      1.01 |
| llama 70B Q4_K_M                   | pp1024 |      1255.38 |                                   1249.06 |      0.99 |
| llama 70B Q4_K_M                   | pp2048 |      1237.33 |                                   1232.74 |      1.00 |
| llama 70B Q4_K_M                   | tg128  |        29.85 |                                     29.98 |      1.00 |
| llama 70B Q4_K_M                   | tg256  |        29.58 |                                     29.68 |      1.00 |
| llama 70B Q4_K_M                   | tg512  |        29.21 |                                     29.34 |      1.00 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | pp512  |      2249.21 |                                   2206.37 |      0.98 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | pp1024 |      2240.98 |                                   2201.13 |      0.98 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | pp2048 |      2239.70 |                                   2194.38 |      0.98 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | tg128  |        90.10 |                                     95.97 |      1.07 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | tg256  |        91.42 |                                     96.03 |      1.05 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | tg512  |        91.65 |                                     95.55 |      1.04 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | pp512  |      7364.98 |                                   7316.76 |      0.99 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | pp1024 |      7229.96 |                                   7203.64 |      1.00 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | pp2048 |      7230.63 |                                   7186.50 |      0.99 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | tg128  |       274.07 |                                    325.74 |      1.19 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | tg256  |       286.29 |                                    327.51 |      1.14 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | tg512  |       286.71 |                                    326.74 |      1.14 |
| qwen3 4B Q4_K_M                    | pp512  |     17249.67 |                                  17036.57 |      0.99 |
| qwen3 4B Q4_K_M                    | pp1024 |     16219.11 |                                  16239.87 |      1.00 |
| qwen3 4B Q4_K_M                    | pp2048 |     15760.55 |                                  15732.80 |      1.00 |
| qwen3 4B Q4_K_M                    | tg128  |       295.78 |                                    335.27 |      1.13 |
| qwen3 4B Q4_K_M                    | tg256  |       296.67 |                                    335.11 |      1.13 |
| qwen3 4B Q4_K_M                    | tg512  |       314.09 |                                    332.05 |      1.06 |
| qwen35 27B Q4_K_M                  | pp512  |      2889.48 |                                   2874.02 |      0.99 |
| qwen35 27B Q4_K_M                  | pp1024 |      2858.55 |                                   2857.95 |      1.00 |
| qwen35 27B Q4_K_M                  | pp2048 |      2857.10 |                                   2845.58 |      1.00 |
| qwen35 27B Q4_K_M                  | tg128  |        65.45 |                                     67.58 |      1.03 |
| qwen35 27B Q4_K_M                  | tg256  |        65.92 |                                     67.40 |      1.02 |
| qwen35 27B Q4_K_M                  | tg512  |        65.44 |                                     66.92 |      1.02 |
| qwen35moe 35B.A3B Q4_K_M           | pp512  |      7267.56 |                                   7275.02 |      1.00 |
| qwen35moe 35B.A3B Q4_K_M           | pp1024 |      7173.63 |                                   7221.01 |      1.01 |
| qwen35moe 35B.A3B Q4_K_M           | pp2048 |      7127.54 |                                   7154.39 |      1.00 |
| qwen35moe 35B.A3B Q4_K_M           | tg128  |       191.59 |                                    233.26 |      1.22 |
| qwen35moe 35B.A3B Q4_K_M           | tg256  |       212.29 |                                    234.41 |      1.10 |
| qwen35moe 35B.A3B Q4_K_M           | tg512  |       211.76 |                                    233.37 |      1.10 |
DGX Spark
| Model                              | Test   |   t/s akmaster |   t/s akieslinger/pdl-cuda-lc-experiments |   Speedup |
|:-----------------------------------|:-------|---------------:|------------------------------------------:|----------:|
| gpt-oss 20B MXFP4 MoE              | pp512  |        4102.32 |                                   4242.29 |      1.03 |
| gpt-oss 20B MXFP4 MoE              | pp1024 |        4144.62 |                                   4339.26 |      1.05 |
| gpt-oss 20B MXFP4 MoE              | pp2048 |        4136.31 |                                   4347.89 |      1.05 |
| gpt-oss 20B MXFP4 MoE              | tg128  |          79.53 |                                     84.05 |      1.06 |
| gpt-oss 20B MXFP4 MoE              | tg256  |          79.55 |                                     84.11 |      1.06 |
| gpt-oss 20B MXFP4 MoE              | tg512  |          78.97 |                                     83.55 |      1.06 |
| llama 3B Q4_K_M                    | pp512  |        7441.01 |                                   7372.57 |      0.99 |
| llama 3B Q4_K_M                    | pp1024 |        7344.68 |                                   7405.66 |      1.01 |
| llama 3B Q4_K_M                    | pp2048 |        7226.86 |                                   7340.45 |      1.02 |
| llama 3B Q4_K_M                    | tg128  |          88.49 |                                     90.37 |      1.02 |
| llama 3B Q4_K_M                    | tg256  |          88.42 |                                     90.32 |      1.02 |
| llama 3B Q4_K_M                    | tg512  |          87.71 |                                     89.71 |      1.02 |
| llama 70B Q4_K_M                   | pp512  |         315.86 |                                    316.65 |      1.00 |
| llama 70B Q4_K_M                   | pp1024 |         314.19 |                                    315.18 |      1.00 |
| llama 70B Q4_K_M                   | pp2048 |         311.22 |                                    311.97 |      1.00 |
| llama 70B Q4_K_M                   | tg128  |           4.63 |                                      4.69 |      1.01 |
| llama 70B Q4_K_M                   | tg256  |           4.63 |                                      4.69 |      1.01 |
| llama 70B Q4_K_M                   | tg512  |           4.62 |                                      4.69 |      1.01 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | pp512  |         571.02 |                                    573.89 |      1.01 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | pp1024 |         548.65 |                                    574.55 |      1.05 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | pp2048 |         571.77 |                                    574.15 |      1.00 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | tg128  |          16.51 |                                     16.95 |      1.03 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | tg256  |          16.56 |                                     16.94 |      1.02 |
| nemotron_h_moe 120B.A12B MXFP4 MoE | tg512  |          16.52 |                                     16.89 |      1.02 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | pp512  |        2188.27 |                                   2233.61 |      1.02 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | pp1024 |        2213.50 |                                   2255.60 |      1.02 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | pp2048 |        2221.78 |                                   2245.72 |      1.01 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | tg128  |          73.50 |                                     76.68 |      1.04 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | tg256  |          73.75 |                                     76.81 |      1.04 |
| nemotron_h_moe 31B.A3.5B Q4_K_M    | tg512  |          73.57 |                                     76.61 |      1.04 |
| qwen3 4B Q4_K_M                    | pp512  |        5470.71 |                                   5420.62 |      0.99 |
| qwen3 4B Q4_K_M                    | pp1024 |        5304.73 |                                   5413.33 |      1.02 |
| qwen3 4B Q4_K_M                    | pp2048 |        5234.26 |                                   5294.97 |      1.01 |
| qwen3 4B Q4_K_M                    | tg128  |          70.79 |                                     72.88 |      1.03 |
| qwen3 4B Q4_K_M                    | tg256  |          70.75 |                                     72.83 |      1.03 |
| qwen3 4B Q4_K_M                    | tg512  |          70.17 |                                     72.29 |      1.03 |
| qwen35 27B Q4_K_M                  | pp512  |         801.70 |                                    810.28 |      1.01 |
| qwen35 27B Q4_K_M                  | pp1024 |         807.04 |                                    815.69 |      1.01 |
| qwen35 27B Q4_K_M                  | pp2048 |         799.88 |                                    811.95 |      1.02 |
| qwen35 27B Q4_K_M                  | tg128  |          11.23 |                                     11.48 |      1.02 |
| qwen35 27B Q4_K_M                  | tg256  |          11.23 |                                     11.46 |      1.02 |
| qwen35 27B Q4_K_M                  | tg512  |          11.22 |                                     11.46 |      1.02 |
| qwen35moe 35B.A3B Q4_K_M           | pp512  |        2312.35 |                                   2310.93 |      1.00 |
| qwen35moe 35B.A3B Q4_K_M           | pp1024 |        2323.34 |                                   2340.47 |      1.01 |
| qwen35moe 35B.A3B Q4_K_M           | pp2048 |        2346.21 |                                   2329.26 |      0.99 |
| qwen35moe 35B.A3B Q4_K_M           | tg128  |          60.31 |                                     62.98 |      1.04 |
| qwen35moe 35B.A3B Q4_K_M           | tg256  |          60.27 |                                     62.72 |      1.04 |
| qwen35moe 35B.A3B Q4_K_M           | tg512  |          60.04 |                                     62.50 |      1.04 |

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, for small autocompletes and inquiries about the code base. Every diff was manually modified, checked and tested by me before adding it to a commit.

aendk added 30 commits February 4, 2026 15:39
…t input pointer access, and "launch" after last write, e.g. to tensors like dst.
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented Apr 29, 2026

Hi @aendk, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

  • Large PR: Large changes require prior discussion (e.g. an issue or RFC) and maintainers may not be able to review this PR as-is. Consider splitting it into smaller, focused PRs.


Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@github-actions github-actions Bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 29, 2026
Copy link
Copy Markdown
Collaborator

@ORippler ORippler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this opens up the possibility for hard-to-catch data-races, I feel we should make this toggle-able at run-time rather than a compile-time feature to facilitate easier debugging and a guaranteed-functionally-correct-path should a bug ever occur. Also, isn't PDL effectively performing a no-op on CC < 9.0 devices? If so, we can simply always compile it and rely on the run-time toggle (i.e. remove the cmake flag).

Also, please clean-up leftover comments before marking this as ready to review.

#define GGML_CUDA_CC_TURING 750
#define GGML_CUDA_CC_AMPERE 800
#define GGML_CUDA_CC_ADA_LOVELACE 890
#define GGML_CUDA_CC_HOPPER 900
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unused?

Comment on lines +113 to +114
# define GGML_CUDA_PDL_SYNC() cudaGridDependencySynchronize()
# define GGML_CUDA_PDL_LC() cudaTriggerProgrammaticLaunchCompletion()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For transpilation we need to add the corresponding aliases for Musa/Hip (or guard this to be CUDA-only for now if these aliases are absent)

Comment on lines +141 to +142
GGML_CUDA_PDL_SYNC();
// GGML_CUDA_PDL_LC(); // FATTN_VEC try 2; on maxq
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clean-up

const uint3 neqk1_magic,
const uint3 rq3_magic,
float scale) {
// GGML_CUDA_PDL_LC(); // GATED_DELTA_NET try 1; always followed by memcpy on qwen3.5, no benefit
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// GGML_CUDA_PDL_LC(); // GATED_DELTA_NET try 1; always followed by memcpy on qwen3.5, no benefit

Comment on lines +95 to +98
constexpr int experts_per_thread = (n_experts > WARP_SIZE) ? n_experts / WARP_SIZE : 1;
float wt[experts_per_thread];
float wt_sum = 0.f;
float output_weights[experts_per_thread];
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are per se not data-accesses, so there is no need to move them. Did you see actual perf gains for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants